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Abstract 

Backgroud: Taking the advan tage of high-throughput single nucleotide polymorphism (SNP) genotyping 
technology, large genome-wide association studies (GWASs) have been considered to hold promise for unravelling 
complex relationships between genotype and phenotype. At present, traditional single-locus-based methods are 
insufficient to detect interactions consisting of multiple-locus, which are broadly existing in complex traits. In addition, 
statistic tests for high order epistatic interactions with more than 2 SNPs propose computational and analytical 
challenges because the computation increases exponentially as the cardinality of SNPs combinations gets larger. 

Results: In this paper, we provide a simple, fast and powerful method using dynamic clustering and cloud 
computing to detect genome-wide multi-locus epistatic interactions. We have constructed systematic experiments to 
compare powers performance against some recently proposed algorithms, including TEAM, SNPRuler, EDCF and 
BOOST. Furthermore, we have applied our method on two real GWAS datasets. Age-related macular degeneration 
(AMD) and Rheumatoid arthritis (RA) datasets, where we find some novel potential disease-related genetic factors 
which are not shown up in detections of 2-loci epistatic interactions. 

Conclusions: Experimental results on simulated data demonstrate that our method is more powerful than some 
recently proposed methods on both two- and three-locus disease models. Our method has discovered many novel 
high-order associations that are significantly enriched in cases from two real GWAS datasets. Moreover, the running 
time of the cloud implementation for our method on AMD dataset and RA dataset are roughly 2 hours and 50 hours 
on a cluster with forty small virtual machines for detecting two-locus interactions, respectively. Therefore, we believe 
that our method is suitable and effective for the full-scale analysis of multiple-locus epistatic interactions in GWAS. 
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Background 

Genome-wide association study (GWAS) has been proved 
to be a powerful genomic and statistical inference tool, 
and its goal is to identify genetic susceptibility through 
statistical tests on associations between a trait of inter- 
ests and the genetic information of unrelated individuals 
[1]. In genetics, genotype-phenotype association studies 
have established that single nucleotide polymorphisms 
(SNPs) [2], one type of genetic variants, are associated 
with a variety of diseases [3]. However, the primary 
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analysis paradigm for GWAS is dominated by the analy- 
sis on susceptibility of individual SNPs, which accordingly 
can only explain a small part of genetic causal effects 
for complex diseases [4]. For better understanding under- 
lying causes of complex disease traits, identifying joint 
genetic effects (epistasis) across the whole genome has 
attracted more attentions [5]. As a matter of fact, sin- 
gle locus-based approaches are insufficient to detect all 
interacting genes, especially for those with small marginal 
effects. The term epistasis was first used in 1909 and it 
was referred as an extension of the concept of dominance 
for alleles within the same allelomorphic pair [6] . In recent 
literatures, epistasis has been defined generally as the 
interaction among different genes [7]. Many studies [8-11] 
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have demonstrated that the epistasis is an important con- 
tributor to genetic variation in complex diseases such as 
asthma, breast cancer [12], diabetes, coronary heart dis- 
ease [13], and obesity [14]. In this article, we consider 
epistatic interactions as the statistically significant associ- 
ations of ^-SNP modules {t > 2) with phenotypes [15], i.e. 
the full association in terms of logistic regression. 

Recently, the problem of detecting high-order genome- 
wide epistatic interaction in GWAS has attracted exten- 
sive research interests. There are two challenges in finding 
high-order genome-wide epistatic interaction on large 
GWAS dataset [16]: The first arises from heavy com- 
putational burden, i.e. the number of association pat- 
terns increases exponentially as the order of interaction 
goes up. For example, 1.25 x 10^^ statistical tests are 
required to detect pairwise interactions for a dataset 
with 500,000 SNPs. The second challenge is that existing 
approaches lack statistical powers for searching high- 
order multi-locus models of disease. Many computational 
algorithms have been proposed to overcome two preced- 
ing difficulties. They can be broadly categorized to three 
groups: exhaustive search, stepwise search and heuristics 
approaches. 

The naive solution to tack the problem is exhaustive 
search using test, exact likelihood ratio test or entropy- 
based test for all modules of multiple-locus. Marchini 
et al. [5] showed that it was computationally possible to 
test two-locus associations allowing for interactions in 
GWAS based on current computation resources. Exam- 
ples in exhaustive search, like MDR and its extensions, 
utilize repeated cross-validations and permutation tests to 
evaluate accuracy and significance of classification [7,17]. 
In addition. Wan et al. [18] proposed a boolean operation- 
based representation to speed up the collection of con- 
tingency tables [19,20]. One major barrier for exhaustive 
search is the intensive computation, and thus parallel 
computing was adopted to further speed up the analy- 
sis of gene-gene interactions. For example, GBOOST [21] 
is a GPU framework based implementation of BOOST, 
and PIAM [22] is developed by Liu et al, which used the 
multi-thread to perform Genome-wide interaction-based 
association (GWIBA) analysis for exhaustive two-locus 
searches. However, finding higher order (more than 2 
loci) disease-related associations are too computation- 
ally expensive to be feasible, especially for large GWAS 
datasets with millions SNPs. In order to deal with the huge 
computation request, stepwise search strategies select a 
subset of SNPs or combinations of SNPs based on some 
low-order statistic tests (or measures), then extend them 
to higher order multi-locus interactions if it is statisti- 
cally possible [5,20,23]. Stepwise approaches are much 
faster than exhaustive algorithms and make high-order 
genome-wide epistasis finding feasible, but they lose pow- 
ers when complex diseases show no or little marginal 



effects. Unlike the previous two strategies, heuristic meth- 
ods adopt machine learning or stochastic procedures to 
search the space of interactions rather than explicitly 
enumerating all combinations of SNPs. SNPruler [24], 
BEAM [25], epiMODE [26] and epiForest [27] fall into 
this category. SNPRuler and a few other pattern-based 
methods use some data mining approaches as filters to 
reduce the number of SNP combinations without assump- 
tions of models. Based on the Markov chain Monte Carlo 
(MCMC) theory, BEAM iteratively calculates the poste- 
rior probability that a locus is associated with the disease 
and/or involved with other loci in epistasis interactions. 
EpiMODE first uses the Gibbs sampling strategy with a 
reversible jump Markov chain Monte Carlo procedure to 
simulate the posterior distribution that genetic variants 
belong to the epistatic modules and screens out statisti- 
cally significant modules based on hypothesis testing. Epi- 
Forest treats SNP markers as categorical features, adopts 
the random forest to discriminate cases against controls, 
and selects a small set of candidate SNPs that could min- 
imize the classification error. An drawback of heuristics 
approaches is that they will leave out a great deal of sig- 
nificant interactions which can be reported by first two 
searching strategies. 

In this paper, we provide a cloud based computational 
method, named "Dynamic Clustering for High-order 
genome-wide Epistatic interactions detecting" (DCHE), 
to address above challenges. Taking advantages of recent 
high-performance computing (HPC) technologies - cloud 
computing to accelerate computations, DCHE adopts an 
elaborated dynamic clustering procedure to maximize 
statistic significance for SNP combinations and ranks 
top ones as results. One benefit of cloud computing 
technologies is that the executional environment and 
experimental conditions can be easily and completely cus- 
tomized by newbies in distributed computing, even for 
large distributed infrastructures [28]. Furthermore, since 
the infrastructure is rented on a pay-per-use rule, imme- 
diate access to required resources for scientific exper- 
iments become possible without planning beforehand. 
With cloud computing, DCHE conducts statistic tests 
on merged groups of genotype categories determined by 
the dynamic clustering. Each grouped genotype category 
tends to share a similar effect associating with corre- 
sponding phenotypes. Truly disease-related joint genetic 
effects will gain higher ranking values, if genotype com- 
binations can be correctly clustered together. Systematic 
experiments on simulated two- and three-locus disease 
models datasets show that DCHE is more powerful in 
finding epistatic interactions than some recently proposed 
methods including TEAM [29], SNPRuler, BOOST and 
EDCF [20]. Our experiments on two real genome- wide 
case/control datasets. Age-related macular degeneration 
(AMD) and Rheumatoid arthritis (RA) demonstrate that 
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DCHE is feasible for the full-scale analyses of multi-locus 
associations on large GWAS datasets and it enriches a 
great deal of novel, significant high-order epistatic inter- 
actions which have not been reported in literatures. 

Results and discussion 

We first give definitions of 6 simulated disease models 
and the power metric used to evaluate the effective- 
ness of DCHE in comparison with other 4 popular 
epistatic interactions detecting methods, i.e. TEAM [29], 
SNPRuler [24], EDCF [20], BOOST [18]. Three reasons 
for choosing above 4 approaches are as follows: (1) TEAM, 
EDCF and BOOST all use the exhaustive search strategy 
for detecting two-locus interactions, so the comparison of 
their performance is fair; (2) a recent review tested five 
available methods and recommended BOOST and TEAM 
as a powerful tool for searching epistatic interactions on 
a genome-wide scale [15]; (3) our goal is to discover high- 
order epistatic interactions from GWAS data, and among 
4 detectors excluding DCHE, only SNPRuler and EDCF 
are equipped the ability to search interactions with more 
than 2 SNPs. Before experiments on simulated datasets, a 
discussion on how to control the false positive rate is illus- 
trated because the Bonferroni correction, most common 
method for controlling error rate, can be too conser- 
vative to filter significant interactions. We also present 
results of DCHE on two real GWAS dataset. Age-related 
macular degeneration (AMD) and Rheumatoid Arthri- 
tis (RA). Interactions detected by DCHE from different 
orders demonstrate a great number of novel, potentially 
disease-related genetic factors. At the end, a systematic 
performance evaluation of DCHE s cloud implementation 
is conducted on a standard Windows Azure cloud cluster 
with up to 40 small size Virtual Machine (VM) instances. 
The speed-up achieved by DCHE shows an approximately 
theoretical acceleration when the cardinality of epistatic 
interaction increases. 

Experimental design 
Data simulation 

To evaluate the effectiveness of DCHE, we perform exten- 
sive simulation experiments using six disease models with 
two- and three-locus associations. The unassociated SNP 
genotypes is generated by the same procedure used in pre- 
vious studies [18]. Minor allele frequencies (MAFs) are 
uniformly sampled from the set [0.05,0.5]. By assuming 
Hardy- Weinberg equilibrium, we can sample the genotype 
^ for individual For embedded disease models, 4 two- 
locus epistasis models and 2 three-locus epistasis models 
are chosen by given odds tables or penetrance table which 
can be found in Additional file 1: Tables SI -S3, and named 
these six models from model 1 to 6. In addition, we 
conduct tests on 50 two-locus epistasis models without 
marginal effects as BOOST and EDCF did in [18,20]. For 



models 1 to 4 and 50 models without marginal effect, each 
simulated dataset contained M = 1000 SNPs and N = 
800 or 1600 with balanced samples in case and control 
under each parameter setting. For model 5, one dataset 
has 1000 SNPs and 2000 or 4000 samples with = N^. 
For model 6, M = 2000 and N is reduced to 400 and 800 
with balanced cases and controls. 

A disease model can be defined either by specifying the 
penetrance table or the odds table. Relations among pen- 
etrance p (D), odds ODDg. and the probability p {D\gi) 
that an individual will be affected with a given genotype 
combination^/ can be calculated as Equation 1, 2. 



p{D\gi) p{D\gi) 



' p{D\gi) l-p{D\gi) 

ODDn, 



+ ODDg, 



(1) 



(2) 



Following [18], the disease prevalence p(P) and genetic 
heritability are given by Equation 3, 4. 



p{D) = Y,P{D\gi)p{gi) 



,o_i:i{p{D\g^) 



-p(D)fp{gi) 



p(D) (l-p(D)) 



(3) 



(4) 



For simplicity, we adopt same parameters as used in [18] 
for model 1 to 4, i.e./? (D) = 0.1, /z^ = 0.03 for model 1 and 
= 0.02 for Models 2, 3 and 4, MAF e {0.1, 0.2, 0.4}. For 
model 5, we adopt similar setting in [20], le, p (D) = 0.1, 
effect size X = 0.2, p g {4,1.5,1,0.7,0.5} and MAF e 
{0.1, 0.2, 0.3, 0.4, 0.5}. For model 6, MAFs of disease asso- 
ciated loci are fixed to 0.5. Effect parameters a and 0 
in odds tables for all six models are determined numeri- 
cally using same procedures in [25]. Settings for 50 models 
without marginal effect are similar to [30], i.e. ranges 
from 0.05 to 0.4 with five intervals and MAF equals to 0.2 
or 0.4. 

Statistical power 

In the comparison of performances on simulated data, 100 
datasets are generated for each setting. In one dataset, we 
embed one ground-truth epistatic interaction. The mea- 
sure of discrimination power used in [18] is adopted, 
which is defined as the fraction of 100 datasets on which 
only top interaction given by the method matches the 
ground-truth. For all programs, the ground-truth interac- 
tion are detected if it is set to the most significant one and 
its adjusted p-value is larger than the critical value which 
is setted to 0.1 in following experiments. 

Experimental setting 

Programs, TEAM, SNPRuler, BOOST (64 bit) and 
EDCF are downloaded from websites provided by their 
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authors. For experiments on simulations, DCHE (in Java), 
SNPRuler (in Java) and BOOST are conducted on a 64-bit 
Windows 8 platform with 1.8 GHz Intel CPU and 4 GB 
RAM; since TEAM and EDCF only provided executable 
program on Linux platform, experiments for TEAM and 
EDCF are conducted on a 64-bit Linux platform with 2.3 
GHz AMD CPU and 16 GB RAM. Experiments on two 
real datasets are performed on Windows Azure platform 
with up to 40 small size VMs. 

False positive rate 

Since DCHE uses stepwise strategy similar to EDCF, we 
also adopt two levels of multiple comparisons: (1) test 
(^) combinations for t loci for a dataset with M SNPs; 
(2) test dynamic clustering results, which could end with 
up to 3^ possible genotype combinations with d groups, 
d G {3, 4, 5, 6}. If we use the Bonferroni correction for 
above two level multiple tests, the upper bound of pos- 
sible ways to do combination is (^)6^^ Hence, it is too 
conservative to obtain significant interaction modules. In 
order to reasonably loose the strictness, inspired by EDCF 
we combine the Bonferroni correction and permutation 
tests for these two levels, that is Bonferroni corrections 
for t loci combinations and permutation tests for the 
dynamic clustering procedure. More specifically, the sig- 
nificant level for an epistatic interaction is calculated as 
Equation 5. 

oi = oio/i ^ I (5) 

In Equation 5, ao is estimated from permutation tests 
for different ts on null simulations and (^) represents 
the Bonferroni correction. To properly control the false 
positive rate, we simulated datasets with five different set- 
tings for each i.e. we either fix M = 1000 and set 
N to 400, 800 and 1600 or fix N = 800 and set M to 
1000, 2000 and 4000. Note that one thousand datasets 
are generated under one setting. The false positive rate 
is defined as fy^/^^/lOOO, where rifaise is the number of 
datasets where DCHE has found one or more interaction 
modules. Test results shown in Figure 1 illustrate: for a 
general setting of critical level 0.1, a recommended ao is 
1.5 X 10~^ for two-locus disease model detection, 1.2 x 
10~^ for three-locus disease model detection and 1.0 x 
10"-^^ for four-locus disease model detection. In addi- 
tion, the false positive rates tend to decrease or remain 
nearly unchanged as the number of samples and SNPs go 
up (Figure IB and IC). Therefore, in tests of simulated 
datasets and two real GWAS datasets, we set ao = 1.5 x 
10"^, 1.2 X 10"^, 1.0 X 10"2i for t = 2,3,4, respectively, 
to control the overall false positive rate for DCHE, unless 
otherwise stated. 



Two-locus disease models 

For a fair comparison, interactions reported by all pro- 
grams are filtered using the critical value 0.1 as the sig- 
nificant threshold. Test results are illustrated in Figure 2 
for model 1 to 4. An common trend for all programs is 
that power is increasing as sample size increases from 800 
to 1600. Most methods show more power when ground- 
truth model interactions' MAFs are larger, except that 
BOOST shows less power on model 1 and 2 when MAFs 
goes up. We can see that DCHE achieves highest or 
comparable powers on almost all datasets. More specif- 
ically, with 24 parameter settings for four disease mod- 
els, DCHE outperforms other four methods at 9 settings 
and obtains full powers at 10 settings and gains compa- 
rable results at 5 settings. Taking results from datasets 
with N = 1600 for example, it is obvious that DCHE 
defeats other approaches with nearly 100% powers. For 
a more straight comparison, we introduce a new con- 
cept, the overall quality q = n correct / ntotab where 

^correct 

is the number of datasets where programs successfully 
detect the ground-truth interactions and ritotal is the total 
number of datasets. When N = 800, M = 1000, the 
overall quality for DCHE, TEAM, SNPRuler, EDCF and 
BOOST are 0.541, 0.455, 0.087, 0.508 and 0.31, respec- 
tively. When N = 1600, M = 1000, all five programs 
achieve higher accuracies than former settings and q are 
0.981, 0.912, 0.162, 0.944 and 0.681, respectively. Note that 
DCHE, TEAM and EDCF have abilities to achieve more 
than 90% powers, and powers for DCHE reach to at least 
98% on datasets with 1600 samples. Note that BOOST is 
designed to identify significant statistical interaction with- 
out considering the main effects, so it is reasonable that 
our method DCHE and other two methods, i.e. EDCF 
and TEAM, outperform BOOST for detecting the model 
1 through 4. The reason why we still put the BOOST 
into the experiments is that the biologists might be more 
interested in epistatic interaction as long as it shows sig- 
nificant association genotypes with phenotypes. In addi- 
tion, similar designs of experiments can be found in other 
literatures [15,18,20]. 

Moreover, we conduct tests on 50 disease models with 
little marginal effects. For convenience, penetrance tables 
for 50 models are not listed, and they are available in 
literature [30]. Since most methods gain near full pow- 
ers, we use box plots to demonstrate overall perfor- 
mances in Figure 3. We can see that DCHE, EDCF and 
BOOST achieve comparable results in two subfigures. 
Specifically, they can accurately detect embedded asso- 
ciated SNPs interactions under most settings. On the 
contrary, TEAM and SNPRuler lose significant powers 
on both datasets with MAFs = 0.2 or 0.4. A com- 
mon trend to previous experimental results is that five 
methods tend to possess more powers as MAFs increase. 
After carefully examining results from five techniques, 
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Figure 2 Performance comparison of DCHE, TEAM, SNPRuler, EDCF and BOOST on disease models 1-4. Performance comparison of DCHE, 
TEAM, SNPRuler, EDCF and BOOST on four disease models for different allele frequencies and sample sizes. The red, green, blue, cyan and magenta 
bars show powers of DCHE, TEAM, SNPRuler, EDCF and BOOST, respectively. Models are ordered from top to bottom and from left to right and they 
are model 1, model 2, model 3 and model 4. 
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Figure 3 Performance comparison on 50 models withiout main effects. For each model, we simulate data using sample size 800 and 
MAF e {0.2, 0.4}. The red, green, blue, cyan and magenta boxes show powers of DCHE, TEAM, SNPRuler, EDCF and BOOST, respectively. 



we can find that DCHE apparently outperforms other 
three methods except BOOST, although the difference 
is not too much. A possible explanation is that these 
embedded models with little main effects are more 
suitable for model-based detection strategy, and DCHE 
is a model-free based method. If we adopt the same 
overall quality defined in previous paragraph to evalu- 
ate performances, q are 0.972, 0.656, 0.891, 0.951 and 
0.984 for DCHE, TEAM, SNPRuler, EDCF and BOOST, 
respectively. 

Two-locus additive models 

Since we intend to find statistically significant associations 
of ^-SNP interaction (t > 2) with phenotypes, an additive 
model is used to evaluate method for detecting additive 
effects. The additive model with 3 different settings can be 
found in Additional file 1: Table S4. The cell in the table 
is the odd-ratio of given genotype, and the odd-ratio can 
be accumulated as the presence of minor allele(s). Both 
the additive effects of two disease related alleles are equal, 
which equal to ^. For the simulation process, we adopt 
the one which is used in [18], and the value of a and p 
are determined by the prevalence p(D) and the genetic 
heritability /z^. We fix p(D) = 0.1 and = 0.03, set 
MFA = 0.1, 0.2 and 0.4 by giving = 800 and 1600. We 
simulate 100 datasets under each setting (3 pairs of a and 
P) with 1000 SNPs under 800 and 1600 samples in balance. 

Test results are illustrated in Figure S2 in Additional 
file 1 for the additive models. As we expect, DCHE, TEAM 
and EDCF show limited powers when the sample size 
is small, and the power goes up as the size of sample 
increases. SNPRuler and BOOST can not detect any pairs, 
since BOOST is designed to detect the statistic interaction 



which is absent in the additive model. Following our pre- 
ceding measurement, the overall qualities are 0.850, 0.858 
and 0.771 for our method, TEAM and EDCF, respectively. 
It is worthy to note that TEAM is time consuming com- 
pared to DCHE. Under the same computing environment, 
TEAM takes about 2 hours to finish analysis of 100 simu- 
lated datasets with 1600 samples, while it can be done in 
15 minutes by using DCHE. 

Three-locus disease models 

For comparisons on three-locus disease models, two 
methods are dropped, i.e. TEAM and BOOST, because 
both of them are designed only for detecting two-locus 
interactions. Based on settings of model 5 given in pre- 
vious sections, we get 10 groups of datasets with 100 
replicates, which can be further categorized to two fami- 
lies with N = 2000 or 4000. Note that when MAF = 0.5, 
there is no marginal effect, otherwise disease models have 
X = 0.2. Experimental results are illustrated in Figure 4. 
Similar to models 1 to 4, three programs generally tend to 
get more powers as MAF or sample size increases. Con- 
sidering parameter p in (P,MAF), we can see that powers 
go up when P goes down for all methods. A significant 
difference can be found is that SNPRuler can only obtain 
acceptable results when (P = 0,5, MAF = 0.5); otherwise 
SNPRuler hardly gets powers. Although the distinction 
between DCHE and EDCF is not too much, we can still 
observe that DCHE hits more ground- truth SNPs inter- 
actions than EDCF does at datasets with N = 4000. 
Additionally, overall qualities for DCHE and EDCF are 
0.514 and 0.52 with = 2000, and the overall qual- 
ity for DCHE rises up to 0.914 comparing with 0.866 
for EDCE 
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Figure 4 Performance comparison on the three-locus epistasis models. Performance comparison of DCHE, SNPRuler and EDCF on two 
three-locus epistasis models, model 5 and model 6, for different allele frequencies and sample sizes. The red, blue and cyan bars show powers of 
DCHE, SNPRuler and EDCF, respectively. 



For Model 6, we set MAF = 0.5 and population preva- 
lence p = 0.01 as EDCF does in [20]. Note that model 6 
with penetrance table in Additional file 1: Table S6 gets the 
maximum when p e (0, j^]. Three methods' results are 
shown in Figure 4, from which we can see that all methods 
can get nearly full powers for model 6. With considering 
the overall quality, DCHE and EDCF both reach 100% and 
SNPRuler hits to 0.965. 



Experiments on AMD data 

Age-related macular degeneration (AMD) is an acquired 
degeneration of the retina which usually affects older 
adults and results in a loss of vision in the centre of the 
visual field. Like many other chronic diseases, AMD is 
caused by a combination of genetic and environmental 
risk factors, including and not limited to macular degen- 
eration gene, too much exposure to sunlight and smoking. 
The reported AMD dataset contains genotypes of 103,611 
SNPs from 96 affected individuals and 50 controls [31]. 
Before applying DCHE on AMD dataset, the same qual- 
ity control used in [18] is applied: SNPs with more than 
10% missing data or MAF < 0.05 or p-values from Hardy- 
Weinberg Equilibrium (HWE) tests less than 0.001 are 
removed. Subsequently, 90,449 SNPs from 50 controls and 
96 cases are left in the dataset. The setting of parameters 
for DCHE is as follows: / = {10000, 4000, 2000}, /: = 2, 3, 4 
and ao = 1.5 x 10"^ 1.2 x 10"^^ 1.0 x 10-^^ for two-, 
three- and four-locus interactions detections, respectively. 

When we set ao = 1.5 x 10~^ to filter out insignificant 
interactions for two-locus epistatic interactions detect- 
ing, DCHE can hardly report any qualified modules, so 
we select top-Zc modules to conduct analysis. For t = 3 
and 4, DCHE generates more than 1000 pairs epistatic 
interactions. In order to give a straight view of results, 
we introduce two concepts, "centre SNPs" and "centre 
genes", similar to those in [32]: we arrange those SNPs 



and genes in descending manner according to their fre- 
quencies showing in top-/: interactions, and select top-5 
SNPs or genes as "centre". Based on the previous proce- 
dure. Table 1 and Table 2 give a general view of DCHE s 
results on AMD dataset with k = 1000 for t = 2, 3, 
k = 500 for ^ = 4, and 5 = 6. Names of SNPs or genes 
showed in boldface indicate their first time to appear in 
the table. As we can see, top-k-s SNPs or genes tend to 
share some common elements among different settings 

Table 1 Centre SNPs identified in top-1 000/500 SNPs 
interactions on AMD dataset 

# SNPs oer Centre SNPs from analyses of AMD dataset 



interaction Centre SNPs (Genomic position) # Interacting SNPs 



rs380390 (Chi: 196701051) 


524 


rsl 329428 (Chi: 196702810) 


253 


^ rsl 394608 (Ch5: 1 55783294) 


23 


rsl 740752 (Chi 0: 38538771 ) 


20 


rsl 363688 (Ch5: 174609731) 


11 


rsl 051 21 74 (Ch9: 88886574) 


11 


rs380390 (Chi: 196701051) 


709 


rsl 329428 (Chi : 1 9670281 0) 


106 


rsl 363688 (Ch5: 1 74609731) 

3 


63 


rs61 8499 (Chi 1 : 1 081 48839) 


47 


rsl 926489 (Chi 3: 92667989) 


35 


rs3781 868 (Ch 1 1 : 1 08059569) 


34 


rs380390 (Chi: 196701051) 


459 


rs618499 (Chi 1 : 1 08148839) 


188 


rs3781868 (Chi 1 : 108059569) 

4 


115 


rs294278 (Ch3: 31 1 2791 1 ) 


36 


rs300780 (Ch2: 110819) 


35 


rs315511 (Chi: 848491 16) 


28 


Note: SNP in boldface indicates the first time to appear in the table. 
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Table 2 Centre genes identified in top-1 000/500 SNPs 
interactions on AMD dataset 



# Genes per Centre genes from gene-only SNP analyses 



interaction 


Centre genes 


# Interacting genes 




CFH: complement factor H 


777 




ZNF25: zinc finger protein 25 


23 


2 


SGCD: sarcoglycan, 
delta (35kDa dystrophin- 
associated glycoprotein) 


23 




LRIG3: leucine-rich repeats 
and immunoglobulin-like 
domains 3 


14 




DRD1 : dopamine receptor 
Dl 


1 1 




ISCA1 : iron-sulfur cluster 
assembly 1 


1 1 




CFH: complement factor H 


815 




DRD1 : dopamine receptor Dl 


63 


3 


ATM: ataxia telangiectasia 
mutated 


47 




GPC5: glypican 5 


43 




NPAT: nuclear protein, 
ataxia-telangiectasia locus 


34 




KDM4C: lysine (K)-specific 
demethylase 4C 


25 




CFH: complement factor H 


/I cn 




ATM: ataxia telangiectasia 
mutated 


191 


4 


NPAT: nuclear protein, 
ataxia-telangiectasia locus 


115 




LRIG3: leucine-rich repeats and 
immunoglobulin-like domains 3 


73 




TGFBR2: transforming 
growth factor, beta receptor 
II (70/80kDa) 


38 




ACP1 : acid phosphatase 1, 
soluble 


35 



Note: Gene in boldface indicates tlie first time to appear in tlie table. 



of order of interactions, t. For AMD dataset, two SNPs 
(rs380390 and rsl329428) already have been reported as 
disease associated SNPs with AMD based on results from 
single allelic association tests with df = 1 [31]. In our 
results, DCHE also ranks rs380390 and rsl329428 as top 2 
centre SNPs both in two- and three-locus epistatic inter- 
actions detecting. Both rs380390 and rsl329428 locate 
inside the gene CFH whose location is lp32, and their 
protein products have an essential role in the regula- 
tion of complement activation and restricting the innate 
defence mechanism to microbial infections. In addition 
to rs380390 and rsl329428, we also find another inter- 
esting SNP, rs3781868, in the category t = 4, rs3781868 
locates in the gene NPAT with location Ilq22-q23 which 



is known to be essential for histone mRNA 3' end process- 
ing and recruiting CDK9 to replication-dependent histone 
genes. 

We also analyse results using gene-only from top- 
1000/500 SNPs subset. Top- 1000/500 SNPs are mapped to 
disease-related genes, which have been annotated on the 
HuGE Navigator database, and we get 720, 851 and 424 
genes for t = 2, 3, 4 showed in Table 3. It is obvious that 
the majority of centre genes have not yet been reported by 
HuGE as associated with the AMD disease. Applying sim- 
ilar analysis used in [32], we submit centre genes reported 
in Table 3 to the ToppGene, a candidate gene prioritiza- 
tion tool [33], to evaluate biological significances of these 
novel genes. From DCHE s results, ToppGene enriches an 
cell-cell communication pathway with the name 'REAC- 
TOME_ADHERENS_JUNCTIONS_INTERACTIONS: 
Reported in Reactome, this pathway contains 14 cen- 
tre genes, and only one gene in this pathway presents 
in HuGE Navigator database. As gene names given in 
HGNC, these genes are PVRL3, CDH18, CDHIO, CDHll, 
CDH12, CDH13, CDH2, CDH4, CDH7, CDH6, CDH9, 
CDH8, CADMl, CADM3. 

For the detection of two-locus epistatic interaction on 
AMD dataset, DCHE successfully indentifies rs380390 
and rsl329428 reported by the original paper. Comparing 
with results obtained by other existing methods, we find 
that there are some overlaps between them. For exam- 
ple, DCHE lists two pairs of SNPs with ranking 246 and 
247 (rsl394608 and rs3743175, rsl394608 and rs2828155), 
which have been identified by epiMODE applied on AMD 
dataset [26]. In addition, DCHE reports another interac- 
tion module (rsl394608 and rs6847164), whose p— value is 
more significant than the above two (p — value unadjusted — 
6.78 X 10"^^). rsl394608 resides within the intron of 
SGCD, a gene located on chromosome 5q33-34, which 
has been implicated in AMD [26]. rs6847164 resides 
within PDE5A, a gene located on chromosome 4q27. 
According to the databases of NCBI and Entrez, PDE5A 
is involved in the regulation of intracellular concentra- 
tions of cyclic nucleotides and is important for smooth 
muscle relaxation. DCHE has also detected other sig- 
nificant three-locus and four-locus interaction modules: 
(rsl0487833, rsl0495593, rsl740752) and (rs9302001, 

Table 3 The disease association of DCHE selected genes 



from gene-only SNP analyses 







Reported in HuGE 


# SNPs per 


# DCIHE genes in top 


Navigator database 


interaction 


1000 (500) SNP pairs 


# Analyzed 


#DCHE 






genes 


genes 


2 


720 




20 


3 


851 


151 


28 


4 


424 




13 



Guo era/. BMC Bioinformatics 2014, 15:102 
http://www.biomedcentral.eom/1 471 -21 05/1 5/1 02 



Page 9 of 16 



rsl0497231, rs380390, rsl940041) whose unadjusted p- 
values are 3.24 x 10"^^ and 8.28 x 10"^^, respec- 
tively. rsl0487833 locates about 0.3Mb upstream of gene 
NAMPT on chromosome 10. NAMPT encodes a protein 
which is thought to be involved in many important biolog- 
ical processes, including metabolism, stress response and 
aging, rs 10497231 resides at about 0.3Mb downstream of 
gene KCNH7 on chromosome 2. KCNH7 encodes a mem- 
ber of the potassium channel, voltage-gated, subfamily 
H related to the functions including regulating neuro- 
transmitter release, heart rate, insulin secretion, neuronal 
excitability, epithelial electrolyte transport, smooth mus- 
cle contraction, and cell volume. rs9302001 locates about 
0.4Mb upstream of gene ABCC4 on chromosome 13. The 
protein encoded by ABCC4 is a member of the superfam- 
ily of ATP-binding cassette (ABC) transporters, which is 
thought to play a role in cellular detoxification as a pump 
for its substrate, organic anions. The clustering details 
of genotype combinations of the above three interaction 
modules can be found in Additional file 1. 

Experiments on RA data 

Rheumatoid arthritis (RA) is a chronic and systemic 
autoimmune disorder, which causes that afflicted joints 
become warm, swollen, tender, stiff, and in the final stage, 
deformed. RA is believed to be a heterogeneous disease in 
which genetic factors account for 60% of disease suscep- 
tibility by rough estimation. The genome-wide RA data 
comes with 545,080 SNPs, 868 cases and 1194 controls 
and it is collected by the NARAC and provided by the 
Genetic Analysis Workshop 16. The same quality control 
for AMD dataset has also been applied to RA dataset. 
Subsequently, 487,678 SNPs from 1,194 controls and 868 
cases remained. The parameters for DCHE are set as fol- 
lows: / = {10000,4000,2000} for t = 2,3,4 and ao = 
1.5 X 10-^ 1.2 X 10-^ 1.0 X 10-21 for two-, three- and 
four-locus interaction detection, respectively. 

Based on definitions of centre SNPs, Table S9 in 
Additional file 1 gives a general view of DCHE results 
on RA dataset with k = 1000 for t = 2,3,4 and 
s = 10. From the overview of centre SNPs, we can see 
that most top-ranked SNPs are coming from chromo- 
some 6 at which the well known MHC region locates. 
Recent studies conducted by the WTCCC via single- 
locus association mapping have shown that RA are 
strongly associated with the MHC region [18]. In addi- 
tion to SNPs in chromosome 6, some other interest- 
ing SNPs located at other chromosomes are detected 
in 4-order SNPs interactions, including rs888206 and 
rsl359679. rs888206 locates in the gene MMD, monocyte 
to macrophage differentiation-associated, whose protein 
product is expressed in vitro differentiated macrophages 
but not freshly isolated monocytes. Another suggested 
alternative function of MMD is related to an ion channel 



protein in maturing macrophages. rsl359679 locates near 
the gene BRINPl on chromosome 9 with location 9q32- 
q33. According to NCBI, BRINPl is within a chromoso- 
mal region which shows loss of heterozygosity in some 
bladder cancers and it may undergo hypermethylation- 
based silencing in some bladder cancers. Table SIO in 
Additional file 1 lists a summary of top-ranked reported 
genes from HuGE Navigator database and DCHE. Simi- 
lar to results in AMD dataset, the majority of potentially 
disease-related interacting genes detected by DCHE are 
novel for RA dataset. Through analyses of top- 10 frequent 
centre genes in Table Sll of Additional file 1, we can see 
that some genes already have been established associa- 
tions with RA, including HLA, BTNL2, C6orfl0, and we 
also find some other potential genetic causal factors, like 
CCAR2, KDM4C. 

Computation time 

From a practical point of view, a key issue of detecting 
high-order epistatic interactions in GWAS is the com- 
putational efficiency. In this section, we evaluate the 
performance of the proposed parallel strategy on Win- 
dows Azure cloud platform with respect to its speed-up. 
To measure the speed-up, we keep the size of datasets 
constants and increase the number of nodes (computing 
cores) in the cloud system. Speed-up given by the larger 
system is defined by the following formula [34]: 

Speedup (p) = —- (6) 
Tp 

where p is the number of nodes (computers), Ti is the 
execution time on one node and Tp is the execution time 
on p nodes. The ideal parallel method is expected to 
demonstrate linear speed-up: a system with p computing 
nodes generates a speed-up of p. However, linear speed- 
up is only a theoretical predication because the speed-up 
is curved by both the inevitable node failures and the 
communication cost which increase as the number of 
computing nodes in cluster becomes large. 

We perform the speed-up evaluation on datasets with 
quite different sizes ranging from 10,000 to 100,000. The 
number of nodes (virtual machines provided by Windows 
Azure platform) varies from 1 to 40 with 5 as minor units. 
Figure 5 shows speed-up for these datasets with three 
settings. As results showed, the proposed cloud imple- 
mentation of DCHE has a very good performance with 
respect to the speed-up. When there are a limited num- 
ber of SNPs in the dataset, e.g. M = 10, 000, it has a lower 
speed-up curve, because dataset with such size along SNP 
dimension can be easily processed by a stand-alone ver- 
sion program alone in a couple of minutes. As the size of 
datasets increases, speed-up performs better (green and 
blue curves). Therefore, we believe our proposed parallel 
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Number of nudes 

Figure 5 Speed-up. Computing nodes are sampled from 1 to 40 with 5 as interval. Tine red, blue, cyan and grey curve show functions of speed-up 
of /M = 1 0, 000, M = 50, 000 and M = ] 00, 000 with sample size fixed to 800. 



strategy for DCHE on Windows Azure cloud comput- 
ing platform can be used to treat massive data analyses 
efficiently. 

Discussion 

Relationship between DCHE and BOOST 

Two key differences lie between DCHE and BOOST: 

• BOOST is designed to detect statistic interactions via 
log-linear models, using the test with df = 4. 
DCHE aims to identify significant associations via 
model-free method, using the test with df ranges 
from 2 to 6. 

• BOOST can be only applied to find gene-gene 
interaction within two-locus. DCHE is flexible to 
search ^-SNP interaction patterns with t >2 and has 
been implemented in cloud platform on Windows 
Azure. 

In order to prove that our method possesses the power 
to detect epistatic interaction which demonstrates signifi- 
cant interaction effect, the overlap of results from BOOST 
and DCHE on two-locus simulated models are showed in 
Additional file 1: Figure S3. The bar with full height means 
that all modules produced by BOOST are all marked 
as the most significant interaction by DCHE. For mod- 
els embedded ground-truth modules with both main and 
interaction effects, nearly all modules reported by BOOST 
are also found by DCHE. Only 3 out of 24 do not reach 
100%, but they are all above 90%. For datasets without 



main effect or with weak main effect, although 7 out of 50 
do not get 100% overlap, it is still convincing to conclude 
that significant interactions identified by BOOST can be 
mostly covered by DCHE. 

The advantages and limitations of DCHE 

The development of DCHE is triggered by the limitations 
of existing works on finding high-order epistatic inter- 
action from genome- wide data. DCHE displays several 
advantages over existing methods: 

1. DCHE detects high-order epistatic interactions from 
genome-wide data without exhaustive enumeration; 

2. DCHE is a model-free method and does not assume 
any prior distribution; 

3. DCHE does not assume any particular epistasis 
model. This is very important for real studies because 
the patterns of SNP interactions are generally 
unknown and could be very complex; 

4. DCHE provides a list of ranked interaction based on 
their significance. 

The current version of DCHE cannot distinguish which 
type of epistatic interaction contributing most to the sig- 
nificant SNP module, i.e. the statistical interaction or 
the full association with only main effects. It is a gen- 
eral problem for other existing model-free methods, like 
MDR, BEAM, TEAM, SNPRuler and EDCR Future work 
can be extended to addressing the above issue. In addi- 
tion, how to reduce false positive errors is a challenging 
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problem in GWASs for our method, although we combine 
permutation test and Bonferroni correction to control 
type I error. Incorporating the haplotype information and 
pathway information to further help reduce false positive 
errors can be another direction of our future work. 

Conclusions 

In this manuscript, a cloud based algorithm DCHE, for 
detecting high-order genome-wide epistatic interactions 
is proposed. The key step of DCHE is the dynamic 
clustering procedure, which is used to guide on how 
to merge genotype categories to a limited and vari- 
able number of groups. By dynamic clustering, DCHE 
tries to approximately categorize genotypes with similar 
genetic effects on phenotypes. The cloud implementation 
of DCHE takes advantages of cloud computing technol- 
ogy, especially the Windows Azure cloud platform with 
high level and efficient I/O operation, queue and blob 
storage, to guarantee the correctness and enable parallel 
statistic testing. Comprehensive and systematic compar- 
isons on simulated datasets shows that DCHE can obtain 
more or comparable powers for both two- and three- 
locus interaction modules detecting comparing to other 
four recently developed algorithms, i.e. TEAM, SNPRuler, 
EDCF and BOOST. Furthermore, experiments on two 
real datasets of AMD and RA demonstrate that DCHE 
discovers many novel high-order associations which are 
significantly enriched in cases and a great deal of centre 
SNPs and genes which only appear in detections of high 
order epistatic interactions. The computation time anal- 
ysis confirms that our method provides a promising way 
to accurately accelerate large genome wide association 
studies. 

Methods 

Notation 

Suppose a G WAS dataset has M diallelic SNPs and N sam- 
ples. In general, bi-allelic genetic markers use uppercase 
letters (e.g. A, 5,...) to denote major alleles and lowercase 
letters (e.g. a, b) to denote minor alleles. For encod- 
ing three genotypes, one popular way is to use {0, 1, 2} 
to represent {AA,Aa,aa}, respectively. In a GWAS case- 
control dataset, denotes the number of cases (i.e. 
disease individuals) and denotes the number of con- 
trols (i.e. normal individuals). X is utilized to indicate the 
ordered set of the M SNPs, and Xi represents the /-th 
SNP in X. MAF{Xi) denotes the minor allele frequency 
of Xi and ^ denotes the genotype of y-th individual at 
X/. For ^-locus interaction, (X/^, . . . one genotype 

combination denotes as (^/^, . . - .gi^)- 

Dynamic clustering 

An intuitive strategy to detect genome-wide epistatic 
interactions is to test differences of genotypes' frequen- 



cies for single SNP or SNPs' combinations in cases and 
controls. The contingency table in Table 4 gives an exam- 
ple for two-locus disease model, where there are 9 geno- 
type combinations and = = 800. Numbers 
within the parentheses are counted from controls. Cells 
with higher frequencies in cases are coloured by grey 
background. Some methods, like Multifactor dimension- 
ality reduction (MDR) [12] and its extensions [35], take 
the case/control ratio of each genotype combination to 
test associations between SNP combinations and disease 
status. However, the frequency cannot be a fair indica- 
tor to uncover disease related associations, because it can 
be biased by many factors, including effect size, allele 
frequency, linkage disequilibrium between markers and 
disease loci as well as sampling errors. Other recent devel- 
oped strategies use Pearsons test, exact likelihood ratio 
test and entropy-based test to examine the independence 
of observations. For the example in Table 4a, the unad- 
justed p-value from Pearsons test with 8 degrees of 
freedom is 1.724 x 10"^^ If we use Bonferroni correc- 
tion to adjust p-value, this pair can still be significant 
with threshold 0.05 for a large GWAS dataset. However, it 
will not always be the case, and some limitations, includ- 
ing uneven or insufficient samples, tiny penetrance on 
single genotype, would dramatically affect the adjusted 
p-value. Another toy example sampled from a two-locus 
multiplicative effects model (see Table 5) is shown in 
Table 4b, where = = 400. Normally, the approx- 
imation to the x^ distribution breaks down if more than 
20% expected frequencies below 5. The unadjusted p- 
value of Table 4b is 1.09 x 10~^ and the adjusted p-value 
is 0.547 if M = 1, 000, which is obviously larger than 
0.05. Another popular method, BOOST [18] which uti- 
lizes the likelihood ratio to test statistic, cannot get a 
significant result for Table 4b by setting the significant 
level to 0.1. We can observe the same result by apply- 
ing EDCF which utilizes the concept of frequent item to 
group genotype combinations and adopts the x^ test with 
df = 2, 

To address the preceding issues, we propose a dynamic 
clustering procedure. Basically, we first merge all 3^ geno- 
type cells to groups based on certain combination 
criteria, where ranges from 3 to 6. The criterion to 
combine two genotype categories is rooted from their 



Table 4 Examples for the contingency tables 







(a) 






(b) 






BB 


Bb 


bb 


BB 


Bb 


bb 


AA 


71(108) 


97(151) 


44(47) 


40(55) 


49(76) 


13(21) 


Aa 


89(138) 


184(184) 


93(55) 


43(62) 


110(103) 


49(22) 


aa 


29(43) 


113(57) 


80(17) 


16(24) 


50(27) 


30(10) 



Note: The entries in boldface are the ones with ratios greater than 
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Table 5 Two-locus Interaction multiplicative effects 





BB 


Bb 


bb 


AA 


a 


a 


a 


Aa 


a 


a(1 +(9) 


aO +(9)2 


aa 


a 


aO +(9)2 


aO +(9)4 



similar effects which associate with phenotypes. Secondly, 
we collect statistic test values on merged groups. For 
better illustration, we take Table 5 for example. Although 
there are 9 genotype categories, some have same pene- 
trances, so we can partition them into 4 groups, where 
penetrances are a,a (1 -\- 0) ,a (1 -\- 0)^ and a (1 -\- 0)^. In 
reality, it is different to predict the order of complex dis- 
ease model and its penetrance table. Therefore, we try to 
find a statistically significant evaluation of interactions in 
a stepwise manner by merging genotype categories into a 
range of number of groups and test levels of significance. 
We select the most significant one as the evaluation for t 
SNPs interaction. The full algorithm of dynamic clustering 
is as follows. 

• Step 1. For a set of SNPs, cross-tabulate genotype 
combinations of SNPs with the categories of the 
dependent variable (phenotype). 

• Step 2. Find a pair of genotype combinations whose 
2x2 sub-table is least significantly different. If this 
significance does not reach a critical value, merge the 
two combinations and consider this merger as a 
single compound combination, and repeat this step. 

• Step 3. Calculate the significant evaluation for each 
merged group pattern when categories' number is 
larger than three and less than six. Select the most 
significant one as the unadjusted p-value as the 
evaluation for the current interaction. 

In Step 2, there are several ways to measure the differ- 
ence of a 2-by-2 contingency table, like Pearsons test 
with df = 1 and phi coefficient. In our algorithm, we 
adopt Pearson's test with df = 1 to measure the dif- 
ference. Following the dynamic clustering procedure, we 
can get the most significant group pattern as 161, 129, 
110 in cases and 238, 59, 103 in controls for Table 4. Note 
that the df varies when the number of clusters changes. 
According to our setting that ranges from 3 to 6, the 
df changes from 2 to 5, respectively. Along the clustering, 
we calculate the p — value unadjusted for each clustering with 
corresponding df. The trace of merging for Table 4b is 
(0, 1, 2, 3, 6), (5, 7, 8) and (4), if cells are labelled from left 
to right, from top to bottom and start from 0. It is easy to 
output this ground-truth interaction by the combination 
of Bonferroni correction and permutation tests for con- 
trolling type-I error, (p - valueunadjusted = 1-15 x 10~^ and 



the significant level is a = 3.0 x 10 ^ with false positive 
rate nearly equals to 0.1). 

Evaluation of interactions 

The goal of DCHE is to identify ^-SNP (t > 2) 
epistatic interactions significantly associated with phe- 
notype. As stated in [7,18], epistasis can be interpreted 
as the statistical interaction or the full association. 
The evaluation of interactions used by DCHE is simi- 
lar to detect full associations in model-based methods. 
In terms of logistic regression, the epistatic interac- 
tion we are looking for may contain main effects or 
interaction effects or both. In order to detect the sig- 
nificant association between genotype and phenotype, 
we use model-free method and p-value from Pearson's 
test to indicate the significance. Since DCHE aims 
to find high-order genome-wide epistatic interaction, 
the high-order interaction module may consist of one 
or some redundant SNPs, which do not contribute to 
increase the significance. To avoid such cases, we give 
a definition of the least possible significant epistatic 
interaction. 

Definition 1. A SNPs module (X/^,X/2, . . . is con- 
sidered as the least possible significant epistatic interaction 
by giving the significant level a, if it meets the following two 
conditions: 

(1 ) the p — value of clusters of (X/^ , Xi^ , . . . , X/^) < a; 

(2) the p — value of clusters of any subset of 
(Xi^fXi2j . . .fXi^) < the p — value of clusters of 

(^iif^i2f • • • f^k)' 

Algorithm 

Since testing all high-order SNP combinations is impos- 
sible for large GWAS datasets of millions of SNPs, 
we utilize a stepwise strategy to emulate and run 
dynamic clustering on all two-locus SNPs combinations. 
As shown in a recent theoretical study [36], the possi- 
bility that a high-order (size-^) combination with strong 
differentiation shows zero differentiation in all of its 
subsets decreases dramatically when t increases (gener- 
ally it becomes impossible for t < 5). Therefore, we 
use top'lt low-order SNPs combinations which demon- 
strate some significance as candidates. For higher order, 
we add one SNP X each time to interactions and re- 
invoke the dynamic clustering procedure, until t reaches 
the defined value. We adopt same bitwise operations and 
Boolean Representation as introduced in BOOST [18] to 
collect and compress contingency tables. The details of 
the sequential algorithm is shown in Algorithm 1. The 
cloud implementation of DCHE will be elaborated in 
next section. 
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Algorithm 1: The DCHE Algorithm 



Input: AnN X (M-\-l) matrix, where N = -\- 

and the first column denotes disease statuses; t, 
critical level a and / = {fc, k} 

Output: The top-It significant ^-locus interaction with 
p - valueadjusted > Oi 

1 Read N x (Af + 1) matrix file; 

2 Boolean represent N x (Af + 1) matrix as 
(3xM) xN matrix W; 

3 Initialize an ascending list L with length as max(/); 

4 Set f = 2; 

5 for each pair ofSNPs, {XuXj), (l < / < ; < Af) do 

6 

7 
8 



Collect the contingency table Q^y; 
Set p — valuBi^j — DynamicClustering(Q,y); 
Insert p — value^ into L; 
9 end 

10 t' = t' + 1; 

11 Initialize another ascending list L' with length max (/); 

12 while <= ^ do 



13 

14 
15 
16 
17 

18 
19 
20 
21 
22 



for each interaction (Xi-^, . . . >Xi^,) in top-It' 
positions ofL do 

for each SNP Xp (l < ; < Af) do 
if; ^ {/!,...,;>} then 

Collect the contingency table Q^,.. 
Set p — valuei^^^^j^,j = 
DynamicClustering (Qi . . ,y) ; 



Insert p — valuei^^^^j^,^j into L'; 



end 



end 



end 



Clean list L, Initialise Lt', Copy top-It' elements in 
V to Ly Lt'y Clean list L^; 

23 end 

24 for each interaction (X/^, . . . ,X/^) in the top-It position 
of{L2,...,Lt} do 

ifp — valuci^^^^j^ < a then 

for subset Q of(Xi^, . . . ,X/^) do 

if Q € {L2, . . .yLt-i} and p — value ofQ > 
the p — value of(Xi^y . . .^Xi^) then 
I break; 
end 
end 

Write {{Xi^, . . . ,Xi^) ,p - valuet^^^^j^) into 
result file; 



25 
26 
27 

28 
29 
30 
31 

32 

33 end 



end 



Each column in matrix M is converted to 3 rows in 
matrix W based on Boolean Representation (line 1-2). An 
ascending list where redundancy is not allowable is initial- 
ized with size max (/). The structure of an element in L 



consists a pair of key and value that the key is SNPs combi- 
nations and value is the unadjusted p-value (line 3). DCHE 
uses bitwise operations to collect contingency tables for 
all two-locus interaction and calculates evaluations of sig- 
nificance via DynamicClustering procedure. The p-value 
will be inserted into L (line 4-9). DCHE only selects top- 
It' interactions to extend in DynamicClustering procedure 
and inserts estimated significance into another candidate 
list V. At the end of each iteration for specific t, list L gets 
cleaned and DCHE transfers top-It' elements from L' to L 
and new list (line 10-23). When t reaches the defined 
value, top-It interaction modules with p — value > a will 
be written into the result file (line 24-33). 

The time complexity of dynamic clustering is 0(3^). 
According to the theory in [36], we only need to apply 
dynamic clustering procedure for up to 4 order of SNPs 
combinations. So when t = 2, the time complexity to 
test all 2-locus interactions is O (NA/fi). Inserting an ele- 
ment into ascending list takes time O (log (max (/))). The 
total time complexity for 2-locus interaction detection 
is 0{NM^) + O (log (max (/))). When t = 3, the time 
complexity to extend all candidate 2-locus interactions to 
3-locus modules is 0(/3Af). Hence, the entire time com- 
plexity reaches to O (feAf) + O (A/M^) + O (log (max (/))), 
if the user plans to search 3-locus interaction. Similar 
time complexity analysis can be applied to higher order 
interaction detection by using our DCHE. 

Cloud implementation 

We implemented DCHE on the Windows Azure plat- 
form [37]. Due to several practical considerations of asso- 
ciation detection for GWAS, like typical GWAS datasets 
reaching up to size of gigabytes, statistic tests for all SNPs 
combinations. A Windows Azure application running in 
the cloud or in data centre can be divided into logical 
parts which are called roles in Windows Azure as shown in 
Figure 6. A role contains a specific set of codes and will be 
running on relatively independent environment. In addi- 
tion, Windows Azure applications can be easily deployed 
to a customized cloud infrastructure, even for users who 
are not HPC experts. 

Since statistic tests of all interaction are independent, 
it is suitable and easy to parallel dynamic clustering pro- 
cedure. The details of cloud framework for DCHE is 
described as follows (see Figure SI in Additional file 1). 
Windows Azure storage service come with highly efficient 
distributed file systems and two basic storage features 
using in DCHE are as follows. (1) Blob: like traditional 
file system, where files can be retrieved by its name, and 
the limit size of single blob file can be up to 50GB [37]. 
(2) Queue: An asynchronous massage passing mecha- 
nism for communication among computing nodes; an 
important feature of Queue is that messages will auto- 
matically show up again until explicitly deleted [38]. Two 
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Figure 6 Illustration of the suggested Windows Azure application model. 



issues for cloud computing are that it may consumes 
huge time in communication and it is difficult to bal- 
ance workload between nodes. Therefore, we design a 
way to split the whole dataset into several parts and 
pack up a bunch of dynamic clustering calculations. 
Detailed steps for Cloud DCHE are elaborated in next 
paragraph. 

• Step 1. The whole matrix of GWAS dataset is 
partitioned into d parts on SNPs dimension. For the 

i'th portion, it contains • • - j, where 

7 = {1, 2, . . . ,M} and p denotes the phenotype. The 
reason why we split dataset in such way is that it can 
optimally balance both workloads within and among 
partitions. 

• Step 2. Each worker role instance reads one copy of 
all partitions which are compressed using Boolean 
Representation [18]. 

• Step 3. Customized parameters used in DCHE will be 
set through web role instance, including size of the 
matrix, numbers of partitions of data, critical value, 
maximum loci interaction pattern and length of the 
ascending list. Note that the number of worker role 
instances do not need to be specified because all tasks 
are executed asynchronously and controlled by a 
unique file name as a key recognized by worker and 
master. 

• Step 4. A particular work role, named master role, is 
used to pack tasks commands into a queue and detect 
running status via emulating files in the blob. There is 
only one instance of master role, which is 
programmed to pack a pair of key (unique file name) 
and value (task orders). 



• Step 5. Worker role instances simply fetch 
commands from queues. The key factor to 
implement fault tolerant is that same undone task 
packs will show up again in queues after a user 
defined period if there are any failure. 

• Step 6. All results will be stored in ascending lists and 
written into blob, when worker role instances finish 
tasks, i.e. dynamic clustering procedures. 

• Step 7. Master role will detect which stage the 
algorithm is running on and communicate with web 
role relying on file information in blob. 

• Step 8. Web role instance is the interface to interact 
between user and DCHE by retrieving running status. 

Availability of supporting data 

DCHE and its cloud implementation code is available 
at http://www.cs.gsu.edu/~xguo9/DCHE.html Password 
for the source code of cloud implementation is dche; 
The simulated data is available on the original paper au- 
thors' websitel (http://bioinformatics.ust.hk/BOOST.html 
and http://discovery.dartmouth.edu/epistatic_data/); The 
genome-wide Rheumatoid arthritis data is provided by the 
Genetic Analysis Workshops 16 (http://www.gaworkshop. 
org/). 

Availability and requirements 

Project name: Cloud Computing for Detecting High- 
Order Genome-wide Epistatic Interaction via Dynamic 
Clustering; Project home page: http://www.cs.gsu.edu/~ 
xguo9/DCHE.html, and https://sourceforge.net/projects/ 
dche/ Operating system(s): Windows 8, Windows Azure 
Programming language: Java 1.7 or higher; C#, coded in 
Visual Studio 2012. 



Guo et al. BMC Bioinformatics 201 4, 1 5:1 02 
http://www.bionnedcentral.conn/1 471 -21 05/1 5/1 02 



Page 15 of 16 



Additional file 

/ \ 

Additional file 1: Supplementary materials. 

V J 

Competing interests 

The authors declare that they have no competing interests. 
Authors' contributions 

XG conceived the study, and wrote the manuscript with contributions from 
other authors. XG designed and implemented the algorithm, DCHE and its 
cloud version program. YM and NY performed the gene annotation, analyzed 
the data and critically read the manuscript. YP coordinated the work, 
conceived and designed the method, and made the major edits. All authors 
read and approved the final manuscript. 

Acknowledgements 

The National Institutes of Health (NOl -AR-2-2263 and ROl -AR-44422), and the 
National Arthritis Foundation provided the RA dataset. This study is supported 
by the Molecular Basis of Disease (MBD) program at Georgia State University. 



Received: 28 October 201 3 Accepted: 1 7 March 201 4 
Published: 10 April 2014 

References 

1 . Sabaa H, Cai Z, Wang Y, Goebel R, Moore S, Lin G: Whole genome 
identity-by-descent determination. J Bioinform Comput Biol 201 3, 

11(02):! 350002. 

2. He Y, Zhang Z, Peng X, Wu F, Wang J: De novo assembly methods for 
next generation sequencing data. Tsinghua Sci Technol 201 3, 
18(5):500-514. 

3. Peter K, Hunter DJ: Genetic risk prediction: are we there yet? N Engl J 
Med 2009, 360(17):! 701 -1703. 

4. He Q, Lin DY: A variable selection method for genome-wide 
association studies. Bioinformatics 201 1 , 27:1 -8. 

5. Marchini J, Donnelly P, Cardon LR: Genome-wide strategies for 
detecting multiple loci that influence complex diseases. Nat Genet 
2005,37:413-417. 

6. Bateson W: Mendel's Principles of Heredity. Cambridge: Cambridge 
University Press; 1909. 

7. Cordell HJ: Epistasis: what it means, what it doesn't mean, and 
statistical methods to detect it in humans. Hunn Mol Genet 2002, 
11(20):2463-2468. 

8. Cai Z, Sabaa H, Wang Y, Goebel R, Wang Z, Xu J, Stothard P, Lin G: Most 
parsimonious haplotype allele sharing determination. BMC 
Bioinformatics 2009, 1 0:1 1 5. 

9. Wang Y, Cai Z, Stothard P, Moore S, Goebel R, Wang L, Lin G: Fast 
accurate missing SNP genotype local imputation. BMC Res Notes 201 2, 
5:404. 

1 0. Cheng Y, Sabaa H, Cai Z, Goebel R, Lin G: Efficient haplotype inference 
algorithms in one whole genome scan for pedigree data with 
non-genotyped founders. Acta Math AppI Sinica, English Series 2009, 
25(3):477-488. 

11. Liu W, Chen L: Community detection in disease-gene network based 

on principal component analysis. Tsinghua Sci Technol 201 3, 

18(5):454-461. 

1 2. Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Pari FF, Moore JH: 
Multifactor-dimensionality reduction reveals high-order 
interactions among estrogen-metabolism genes in sporadic breast 
cancer. Am J Hum Genet 200 1 , 69:1 38-1 47. 

1 3. Nelson M, Kardia S, Ferrell R, Sing C: A combinatorial partitioning 
method to identify multilocus genotypic partitions that predict 
quantitative trait variation. Genome Res 2001 , 1 1 (3):458-470. 

1 4. Cordell HJ: Detecting gene-gene interactions that underlie human 
diseases. Nat Rev Genet 2009, 10:392-404. 

1 5. Wang Y, Liu G, Feng M, Wong L: An empirical comparison of several 
recent epistatic interaction detection methods. Bioinformatics 201 1, 
27(21):2936-2943. 



1 6. Fang G, Haznadar M, Wang W, Yu H, Steinbach M, Church TR, Getting WS, 
Van Ness B, Kumar V: High-order SNP combinations associated with 
complex diseases: efficient discovery, statistical power and 
functional interactions. PLoS ONE 2012, 7(4):e33531. 

1 7. Cattaert T, Calle ML, Dudek SM, Mahachie John JM, Van Lishout F, Urrea V, 
Ritchie MD, Van Steen K: Model-based multifactor dimensionality 
reduction for detecting epistasis in case-control data in the 
presence of noise. Ann Hum Genet 201 1 , 75:78-89. 

1 8. Wan X, Yang C, Yang Q, Xue H, Fan X, Tang NL, Yu W: BOOST: a fast 
approach to detecting gene-gene interactions in genome-wide 
case-control studies. Am J Hum Genet 201 0, 87(3):325-340. 

1 9. Wan X, Yang C, Yang Q, Xue H, Tang NLS, Yu W: Detecting two-locus 
associations allowing for interactions in genome-wide association 
studies. Bioinformatics 201 0, 26(20):251 7-2525. 

20. Xie M, Li J, Jiang T: Detecting genome-wide epistases based on the 
clustering of relatively frequent items. Bioinformatics 201 2, 
28:5-12. 

21 . Yung LS, Yang C, Wan X, Yu W: GBOOST: a GPU-based tool for 
detecting gene-gene interactions in genome-wide case control 
studies. Bioinformatics 201 1, 27(9):1309-1310. 

22. Liu Y, Xu H, Chen S, Chen X, Zhang Z, Zhu Z, Qin X, Hu L, Zhu J, Zhao GP, 
Kong X: Genome-wide interaction-based association analysis 
identified multiple new susceptibility loci for common diseases. 
PLoS Genet 201 1, 7(3):el001338. 

23. Li J: A novel strategy for detecting multiple loci in Genome-Wide 
Association Studies of complex diseases, int J Bioinform Res AppI 2008, 
4(2):150-163. 

24. Wan X, Yang C, Yang Q, Xue H, Tang NL, Yu W: Predictive rule inference 
for epistatic interaction detection in genome-wide association 
studies. Bioinformatics 201 0, 26:30-37. 

25. Zhang Y, Liu JS: Bayesian inference of epistatic interactions in 

case-control studies. Nat Genet 2007, 39:1 1 67-1 1 73. 

26. Tang W, Wu X, Jiang R, Li Y: Epistatic module detection for 
case-control studies: a bayesian model with a Gibbs sampling 
strategy. PLoS Genet 2009, 5(5):el 000464. 

27. Jiang R, Tang W, Wu X, Fu W: A random forest approach to the 
detection of epistatic interactions in case-control studies. 
BMC Bioinformatics 2009, ^0{Supp\ 1):S65. 

28. Guo X, Ding X, Meng Y, Pan Y: Cloud computing for de novo 
metagenomic sequence assembly. In Bioinformatics Research and 
Applications Volume, 7875 of Lecture Notes in Computer Science. Edited by 
Cai Z, Eulenstein 0, Janies D, Schwartz D. New York: Springer Berlin 
Heidelberg; 201 3:1 85-1 98. 

29. Zhang X, Huang S, Zou F, Wang W: TEAM: efficient two-locus epistasis 
tests in human genome-wide association study. Bioinformatics 2010, 
26(12):i217— 1227. 

30. Velez DR, White BC, Motsinger AA, Bush WS, Ritchie MD, Williams SM, 
Moore JH: A balanced accuracy function for epistasis modeling in 
imbalanced datasets using multifactor dimensionality reduction. 

Genet Epidemiol 2007, 31 (4):306-31 5. 

31 . Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, 
SanGiovanni JP, Mane SM, Mayne ST, Bracken MB, Ferris FL, Ott J, 
Barnstable C, Hoh J: Complement factor H polymorphism in 
age-related macular degeneration. Science 2005, 308(5720): 
385-389. 

32. Piriyapongsa J, Ngamphiw C, Intarapanich A, Kulawonganunchai S, 
Assawamakin A, Bootchai C, Shaw P, Tongsima S: iLOCi: a SNP 
interaction prioritization technique for detecting epistasis in 
genome-wide association studies. BMC Genomics 20]2, 13(Suppl 7):S2. 

33. Chen J, Bardes EE, Aronow BJ, Jegga AG: ToppGene suite for gene list 
enrichment analysis and candidate gene prioritization. Nucleic Acids 
Res 2009, 37(suppl 2):W305— W31 1. 

34. Xu X, Jager J, Kriegel HP: A fast parallel clustering algorithm for large 
spatial databases. In High Performance Data Mining. Edited by Guo Y, 
Grossman R. New York: Springer US; 2002:263-290. 

35. Oh S, Lee J, Kwon MS, Weir B, Ha K, Park T: A novel method to 
identify high order gene-gene interactions in genome-wide 
association studies: Gene-based MDR. BMC Bioinformatics 201 2, 
13(Suppl 9):S5. 

36. Steinbach M, Yu H, Fang G, Kumar V: Using constraints to generate and 
explore higher order discriminative patterns. In Advances in 



Guo et al. BMC Bioinformatics 201 4, 1 5:1 02 



Page 16 of 16 



http://www.bionnedcentral.conn/1 471 -21 05/1 5/1 02 



Knowledge Discovery and Data Mining, Volume 6634. Edited by Huang J, 
Cao L, Srivastava J. New York: Springer Berlin Heidelberg; 201 1 :338-350. 

37. Windows Azure Blobs: Programming Blob Storage [http://go. 
microsoft.com/fwlink/?Linkld=l 53400] 

38. Windows Azure Queue - Programming Queue Storage [http://go. 
microsoft.com/fwlink/?Linkld=l 53402] 



doi:1 0.1 1 86/1 471 -21 05-1 5-1 02 

Cite this article as: Guo et al.: Cloud computing for detecting high-order 
genome-wide epistatic interaction via dynamic clustering. BMC Bioinformatics 
2014 15:102. 

V y 



Submit your next manuscript to BioMed Central 
and take full advantage of: 



• Convenient online submission 



• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



(3 BioMed Central 



